annotate_my_genomes: an easy-to-use pipeline to improve genome annotation and uncover neglected genes by hybrid RNA sequencing

Abstract Background The advancement of hybrid sequencing technologies is increasingly expanding genome assemblies that are often annotated using hybrid sequencing transcriptomics, leading to improved genome characterization and the identification of novel genes and isoforms in a wide variety of organisms. Results We developed an easy-to-use genome-guided transcriptome annotation pipeline that uses assembled transcripts from hybrid sequencing data as input and distinguishes between coding and long non-coding RNAs by integration of several bioinformatic approaches, including gene reconciliation with previous annotations in GTF format. We demonstrated the efficiency of this approach by correctly assembling and annotating all exons from the chicken SCO-spondin gene (containing more than 105 exons), including the identification of missing genes in the chicken reference annotations by homology assignments. Conclusions Our method helps to improve the current transcriptome annotation of the chicken brain. Our pipeline, implemented on Anaconda/Nextflow and Docker is an easy-to-use package that can be applied to a broad range of species, tissues, and research areas helping to improve and reconcile current annotations. The code and datasets are publicly available at https://github.com/cfarkas/annotate_my_genomes

The emergent advancement of Next Generation Sequencing (NGS) combined with 53 novel genome assembly methods greatly improved genome characterization, identifying novel 54 genes and isoforms in both model as well as non-model organisms [1][2][3]. RNA-sequencing 55 (RNA-seq) based on short reads resolve transcriptomes in a limited manner due to technical 56 limitations in assembly [4]. Long-read RNA-seq technologies alone or combined with short-57 read sequencing often improve the quality and contiguity of transcriptome assemblies [5,6]. 58 Long-read technologies such as PacBio single-molecule real-time (SMRT) and Oxford 59 Nanopore (ONT) sequencing technologies (hereafter PacBio and Nanopore sequencing, 60 respectively) are more efficient than short-read RNA-seq to reconstruct full-length transcripts 61 by using error correction and polishing pipelines [7]. Well-established PacBio-only based 62 pipelines such as IsoSeq [8,9] and IsoCon [10] often perform well on these tasks and hybrid 63 sequencing even outperforms these methods producing better transcriptome assemblies [11-64 13]. After assessing the best transcriptome assembly with tools such as rnaQUAST [14], 65 SQANTI [15], or by using multiple assemblies to improve gene structure annotation [16], an 66 additional challenge in transcriptomic studies is the feature identification and annotation 67 process. Initially, pipelines such as MAKER integrated trained ab initio gene predictions, 68 Expressed Sequence Tags (EST), and proteins to annotate genes from a given genome [17,18]. 69 In the same way, the gene prediction program AUGUSTUS accurately predicts genes using 70 supervised training of EST and proteins as external hints, including the use of short read RNA-71 seq alignments to improve final gene prediction [19,20]. Later, BRAKER1 pipeline was 72 developed, a short read RNA-seq genome annotation pipeline that combines AUGUSTUS and 73 GeneMark-ET, an unsupervised RNA-seq gene prediction tool [21,22]. Subsequently, 74 Even though IsoSeq displayed better assembly statistics than Illumina technology, 156 Illumina and merged sequencing technologies led to higher-quality assemblies assessed by the 157 number of completed BUSCOs found in the aves lineage (aves_odb10) (Figure 1C). This 158 result is expected since Illumina has better sequencing depth and quality than IsoSeq (~1% 159 versus ~11% sequencing error, respectively) [36]. Also, the merged strategy slightly improved 160 the completeness of the SCO transcriptome compared to the Illumina assembly (1%). Thus, 161 despite higher Ex90N50 values of IsoSeq assembly, we further selected the merged strategy to 162 annotate genes, because of its increased quality over Illumina and PacBio assemblies alone. 163 Our pipeline identified 19,690 reconciled genes including 4,292 candidate genes that are not 164 annotated on the chicken GRCg6a reference genome, from them 64% and 9% corresponding 165 to coding genes ( Figure 1D, left and right, respectively). At the level of transcripts, we 166 observed 61,940 reconciled and 5,201 non-annotated transcripts, where 76% and ~10% are 167 coding transcripts, respectively (see Figure 1E, left and right, respectively, and 168 Supplementary Table 1). Of note, a substantial number of transcripts not classified as either 169 coding or long non-coding RNAs composed missing transcripts (65%), arguing that these 170 transcripts survived the RNA quality control system removal from the cell [37,38] and could 171 correspond to small RNAs, incomplete gene models, and/or transcripts emanating from repeat 172 regions [39]. We aimed to classify the discovered long noncoding RNAs (lncRNAs) by 173 location and subtype, by using the FEELnc classification tool [34]. By location, concerning 174 neighboring genes, we found a significant proportion of exonic/intronic lncRNAs types (~11%, 175 Figure 1F). Also, by orientation, a significant proportion of all lncRNAs are divergent 176 lncRNAs (27%, Figure 1G). Overall, our results confirm that hybrid sequencing is beneficial 177 for a comprehensive and reconciliated characterization of a given transcriptome, which agrees 178 with a previous report [25]. In addition, our tool provides a way to streamline the annotation 179 process in a user-friendly manner.  to properly assemble the SSPO gene (SCO-spondin), one of the main secreted glycoproteins 196 from the SCO (Figure 2A). Also, SSPO related transcripts did not figure in the circular 197 consensus sequences (CCS) or in the high-or low-quality assembled transcripts from IsoSeq 198 (https://github.com/ben-lerch/IsoSeq-3.0).
Conversely, Illumina sequencing from 199 HH23/HH30 stages aligned to SSPO locus (see turquoise and blue colored tracks in Figure  200 2A, respectively). Combined Illumina-PacBio sequencing led to the assembly of four SSPO 201 isoforms of 106, 95, 89, and 25 exons, respectively (see the red-colored track and gene track 202 underneath in Figure 2A). The merged strategy led to the assembly of two SSPO isoforms of 203 106 exons, and 3 isoforms of 105, 17, and 25 exons, all of them classified to encode proteins 204 and not lncRNAs (see transcripts 1, 3, 2, 4, and 5 in Gene_PacBio_Illumina track from Figure  205 2A, respectively). Transcript 1 encodes a protein of 5270 amino acids with 98.88% of identity 206 with a previously deduced SSPO protein in chicken, derived from a cloned cDNA in SCO 207 (GenBank accession AJ 866919) [42]. The latter protein contains 5255 amino acids encoded 208 within 105 exons, lacking the first assembled exon of our reconstructed transcripts. Thus, the 209 merged strategy leads to a superior assembly consisting of five alternative isoforms (see green-210 colored numbers indicating the new exons in Figure 2A). Also, the use of only Illumina 211 assembly leads to an incomplete assembly of SSPO gene at the 5' end (see Illumina gene track 212 in Figure 2A, the track called "merged_illumina.gtf"). Regarding the latter, no degradation of TSEBRA and annotate_my_genomes, but not ab initio AUGUSTUS and SQANTI3 methods 232 correctly assembled all previously described exons from SSPO (n = 105) ( Figure 3A). Of 233 notice, annotate_my_genomes method assembled these 105 exons, including an additional 234 exon (hereafter exon 1) across SSPO isoforms ( Figure 3B). These preliminary results indicate 235 that our method can resolve more exons than the referred methods, however this result might 236 not necessarily imply a better assembly. 237 Therefore, we additionally annotated Mus musculus (mm10), Homo sapiens (hg38), 238 Danio rerio (danRer11) and Caenorhabditis elegans (ce11) RNA-seq datasets sequenced with 239 short and long reads, using the referred annotation methods (see Supplementary Table 4 for 240 sequencing datasets). We compared the gene annotation predictions from each method in GTF 241 format against the NCBI reference GTF from each genome (the latter considered as truth), 242 using gffcompare [44]. The evaluated parameters covered bases, exon, intron, intron-chain, 243

248
. We considered F1-score as the final measure of gene prediction accuracy for each method. In 11 all evaluated parameters, annotate_my_genomes using genome-guided StringTie GTF as input 250 outperform all methods in our sequencing dataset (galGal6) (Figure 3C, see asterisks). This 251 behaviour was also seen in Mus musculus, Homo sapiens, Danio rerio and Caenorhabditis 252 elegans, excepting at the intron level, where BRAKER2 outperform our method in three out of 253 four datasets ( Figure 3D, see asterisks). Of note, SQANTI3 outperformed our method in one 254 evaluated parameter (locus level) in one dataset (Danio rerio). annotate_my_genomes using 255 de novo GTF assembly produced with StringTie as input performed similarly as BRAKER2 or 256 TSEBRA in each dataset, sometimes surpassing BRAKER2/TSEBRA (see galGal6 in Figure  257 3C and mm10, hg38 in Figure 3D, respectively). 258 We also noticed CPU times were equal or inferior when annotate_my_genomes method 259 is employed, in comparison with BRAKER1, BRAKER2, or TSEBRA across the referred 260 Figure 3) Although SQANTI3 method was very fast, this 261 method only took as input the alignment of long reads in the form of a GTF assembly, not 262 considering short read processing, thus explaining the short runs. 263

RNA-seq datasets (Supplementary
In summary, with the availability of a good genome assembly including genome 264 annotation in GTF format, it is beneficial to run annotate_my_genomes using a genome-guided 265 StringTie GTF file as input when dealing with hybrid RNA-seq datasets. Also, in the absence 266 of a genome annotation file in GTF format, it is worth running annotate_my_genomes, using 267 de-novo StringTie GTF as input, since this method performs similar or sometimes better than 268 BRAKER2/TSEBRA. chicken proteins and of them, around 70% were present with high similarity between the 281 predicted and UniProt annotated proteins, demonstrating a good agreement between real and 282 predicted protein sequences derived from our assembly process (> 98%, Figure 4A). With this 283 schema, we mapped missing paralogs in NCBI galGal6 genome by selecting proteins between 284 90-100% identity with Gallus gallus UniProt proteins. Since we have the genomic positions of 285 the transcripts that encode for all these proteins, if two proteins have near 100% identity by 286 blast analysis and the correspondent transcripts map within a 500 kb window, these proteins 287 are considered as paralog candidates. Conversely, if the transcript maps to different loci 288 positions, we considered them as homologs [45,46]. Also, we examined if the transcripts that 289 were associated with novel proteins overlapped with loci of previously annotated genes that 290 were missing in the current annotation due to lack of evidence in the NCBI database. If that is 291 the case, we considered these proteins as isoforms from missing genes. 292 Additionally, we benchmarked our blast results with the use of eggNOG-mapper, a 293 method employing sequence homology search on a metagenomic scale [47,48]. The output 294 from eggNOG-mapper was intersected with the previous BLASTp results as described here: 295 https://github.com/cfarkas/annotate_my_genomes/wiki#5-annotate-and-identify-homologs-296 in-novel-proteins-from-transcriptome. With both methods, we confirmed a substantial amount 297 of novel coding genes encoding for endogenous retrovirus genes (ERVs) and genes containing 298 homology with Ribonuclease H domains, zinc finger domains (CCHC domain-containing 299 proteins), and olfactory genes, among others ( Figure 4B, Supplementary Table 3). Among 300 the latter, we mapped fifteen missing candidate genes with different unfinished annotation 301 status in NCBI database, including six novel homologous genes and novel paralog genes in 302 chicken, respectively. Of the missing genes, Ubiquitin Specific Peptidase 53 (USP53) was 303 absent in NCBI annotation but was present in the Ensembl annotation whereas Aminoadipate-304 Semialdehyde Dehydrogenase (AASDH) was missing in both databases ( Figure 4C, upper and 305 lower, respectively). We also confirmed the existence of an alpha macroglobulin paralog, 306 downstream to A2ML3 loci, located in chromosome 1 of galGal6 genome ( Figure 4D) and a 307 novel homolog of VCP gene, spanning half of an unmapped contig belonging to chromosome 308 Z (chrZ_NW_020109829v1_random, Figure 4E). Since the VCP gene maps to chromosome 309 Z, this novel homolog could be classified as a VCP paralog due to its close blast homology 310 with VCP proteins, but the proximity of this unplaced contig cannot be determined with respect 311 to the VCP gene. (Supplementary Table 3). 312 In summary, our ortholog/paralog assignations of non-annotated coding genes can help 313 to increase annotation of important missing genes and aid to reconcile current annotations 314 instead of choosing a single annotation tool from either the NCBI, Ensembl, and/or other 315 sources, a common practice in next-generation sequencing analysis [49]. Also, these 316 procedures can aid to identify novel ERVs, possibly encoding for functional proteins due to 317 their evolutionary conservation in vertebrates [50]. Here, we have developed and presented a hybrid RNA-seq annotation pipeline that 322 helps to increase genomic annotation and allows researchers to discover missing/homologous 323 genes by integrating previous genomic annotations in various animal genomes, providing a 324 reconciled annotation in GTF format. This pipeline can be used for any organism that has an 325 assembled genome and an NCBI available annotation in GTF format and relies on the use of 326 StringTie as a transcript assembler. By using previously well-known tools, this pipeline can 327 efficiently identify non annotated genes versus reconciled genes and distinguish between 328 coding and non-coding genes. We benchmarked our pipeline against well-known bioinformatic 329 genome annotation pipelines such as BRAKER1/2, TSEBRA and SQANTI3 across 330 transcriptomes from five different species. According to the F1-score, our method performed 331 equal or better than existing pipelines in terms of assembly quality. Moreover, F1-scores using 332 genome-guided GTF assemblies as input for our pipeline were scored with the highest F1 333 values. Therefore, if curated genome annotations are present for a given genome, it is beneficial 334 to run our pipeline, since our method reconcile the current gene annotation and include novel 335 loci. As proof of a concept, we fully assembled the chicken SSPO gene consisting of 106 exons 336 rather than the previously published 105 exons, demonstrating good functionality in well-337 annotated genomes such as the chicken genome. Our pipeline assembled five transcripts with 338 coding protein potential derived from the SSPO locus in the SCO. The assembly contained > 339 100 exon isoforms that are consistent with the presence of high molecular weight bands in the 340 SCO previously reported by western blot using anti SCO-spondin (350-300 kDa) as well as 341 lower bands ranging from 200 to 50 kD, probably corresponding to these smaller isoforms [28]. 342 At the time of writing of this manuscript, the protein sequence of chicken SSPO was recently 343 updated in NCBI (NCBI Reference Sequence: NM_001006351.3, October 25, 2021), the 344 sequence associated with the novel chicken genome assembly bGalGal1.mat.broiler.GRCg7b 345 (assembly accession: GCF_016699485.2). The SSPO protein from bGalGal1 assembly 346 contains the novel assembled exon 1 described in this manuscript, which was only detected 347 with annotate_my_genomes method. Therefore, this independent result supports the quality of 348 the SSPO transcripts assembled with our pipeline. 349 Since genome assemblies often update, this tool can aid in rapidly assigning genomic 350 coordinates to missing genes, by inputting the updated genome assembly and correspondent 351 annotation in the pipeline. This was the case of USP53 and AASDH genes, the latter was 352 missing in all genomic annotations since the galGal4 chicken genome assembly was released 353 in 2004 [51]. Also, we discovered a novel VCP homolog spanning half of an unmapped contig 354 belonging to chromosome Z. We thus encourage researchers in the transcriptomics field to 355 consider performing our novel assembly and re-annotation of RNA-seq data rather than using 356 a single GTF annotation file in their studies. Importantly, in SCO organ development we 357 discovered a myriad of divergent lncRNAs according to FEELnc lncRNA classification tool, 358 potentially important in the differentiation of neural stem cells [52]. Overall, we propose that 359 this pipeline will be a useful resource for obtaining a comprehensive view of the transcriptional 360 landscape in each study and will help researchers to characterize novel transcriptomes and 361 increase current genome annotations. The present work will have two major impacts on the research community. On one side, 366 our pipeline will facilitate the transcriptomic annotation of hybrid sequencing for research 367 without advanced coding skills. This pipeline is implemented as an easy-to-use package that 368 integrates gold standard methods associated with transcriptome annotation. On the other side, 369 our work advances our understanding of the chicken brain transcriptome by displaying an 370 updated annotation, which includes full-length transcripts with challenging structures to 371 assemble. We expect that our method will be useful for biologists interested in improving 372 transcriptome annotation on a wide range of species, tissue and research areas. As well, our 373 dataset will help to understand the development of specific brain structures providing a 374 transcriptomic resource that can be consulted by all the community.   benchmarking. The precision, recall, and their harmonic mean -the F1-score -as measures of 454 gene prediction accuracy were obtained by using gffcompare. 455

Homolog Assignments 456
To assess possible homologs in novel coding genes (cds), we blasted the novel predicted 457 proteins from these cds against the UniProt Gallus gallus proteome (taxid 9031) [66] with the 458 setting -max_hsps 1 -max_target_seqs 1 in blastp command [67]. Then, we parsed these results 459 and compared the genomic positions of all novel protein matches against the genomic positions 460 of proteome indexed in NCBI. If two matches with 90-100% homology were found within the 461 same loci (<0.5 Mb), we considered them as paralogs [45,46]. Otherwise, we considered these 462 matches as missing genes in the reference annotation. We also integrated to the previous results 463 the metagenome-level annotation of novel proteins using eggNOG-mapper ortholog 464 classification software [47,48]. All relevant commands to reproduce these analyses are 465 available here: https://github.com/cfarkas/annotate_my_genomes/wiki#5-annotate-and-466 identify-homologs-in-novel-proteins-from-transcriptome 467 468

Data and Code Availability 477
All computational steps to replicate the analysis performed in this paper are available 478 here: https://github.com/cfarkas/annotate_my_genomes. We provide on the GitHub page an 479 easy-to-install package of our pipeline that can be run on a modern laptop using Linux/Ubuntu for Mus musculus (mm10), Homo sapiens (hg38), Danio rerio (danRer11) and Caenorhabditis 552 elegans (ce11) genome RNA-seq datasets, respectively. The benchmarked methods were the 553 following: annotate_my_genomes (with or without genome guide) BRAKER2, TSEBRA, 554 AUGUSTUS Ab initio, and SQANTI3. Black asterisks indicate the best F1 scores per method. 555 The color scale indicates lower and higher F1 values with blue and yellow scales, respectively.   Please find attached our new version of the manuscript entitled 'Annotate_my_genomes: an easy-to-use pipeline to improve genome annotation and uncover neglected genes by hybrid RNA sequencing' that we would like to be considered for publication in GigaScience.
A previous version of this manuscript was sent in August 2021 (GIGA-D-21-00247), but on that occasion, the reviewers considered that it was necessary to improve the pipeline giving us several suggestions on this matter. We appreciate the revision made that largely improved our work. Now, we are sending a new version of our manuscript with all the reviewers' concerns fully addressed. One of the most important improvements we made to our pipeline involves the implementation as an Anaconda environment and as a Nextflow container workflow, providing different alternatives to the users, including reproducibility. On the other hand, we benchmarked our pipeline with different genome annotation workflows such as AUGUSTUS, BRAKER1, BRAKER2, the recent TSEBRA method, and PacBio-dedicated SQANTI3 workflow. Annotate_my_genomes, relying on StringTie assembler, often displayed equal or superior performance based on the F1-score against truth (please see Figure 3).
As stated in the initial submission, our study provides a user-friendly bioinformatic tool that performs genome-guided transcriptome annotation, using as input assembled transcripts from hybrid sequencing such PacBio and Illumina technology. This pipeline distinguishes between coding and long non-coding RNAs by the integration of several bioinformatic approaches, including gene reconciliation with previous annotations in Gene Transfer Format (GTF). Although the use of this pipeline was originally designed to annotate assembled transcripts from hybrid sequencing, it can also be applied to annotate assembled transcripts from mapped data derived from Illumina alone or other sequencing technologies. We first demonstrated the efficiency of our approach by correctly assembling and annotating all exons from the chicken SCO-spondin gene (containing more than 105 exons), a giant gene that consistently lacks coding information in NCBI or Ensembl databases. This gene encodes a protein of more than 5000 amino acids and its expression is highly conserved in Chordates and restricted to the sub-commissural organ, a brain gland related to different morphogenic events, such as the regulation of brain development and body axis alignment. We also successfully mapped hundreds of missing genes in the chicken reference annotations by homology assignments, including the characterization of USP53 and AASDH genes. As an example, the latter gene is missing in all genomic annotations since the chicken genome assembly was released in 2004 (galGal4). In this sense, our method not only allows us to annotate new transcripts but also reconcile current annotations providing a compressive result.
Below, we will proceed to detail the responses to all the comments and requests made by the reviewers: Reviewer #1: The manuscript "annotate_my_genomes: an easy-to-use pipeline to improve genome annotation and uncover neglected genes by hybrid RNA sequencing" by Farkas et. al. present a pipeline to annotate genomes with available genome annotations. The manuscript is well written in a clear and concise way. The authors present a methodology to integrate long-read RNA-Seq data with short-read to generate the assembly. The annotation pipeline uses also a

Cover Letter
Click here to access/download;Personal Cover;CoverLetter_GigaScience_22032022.docx

Universidad de Concepción
combination of Illumina (short-reads) and PacBio (long-reads) sequencing. The authors test the annotation pipeline with chicken brain data. They are able to identify and annotate genes that were missed in the public genome release for this organism. I consider that there are few issues that should be added to the manuscript.

Q1.
StringTie is used as assembler. I think that the manuscript should include a few comments on this step as there are other assembler like Trinity that can be used. There are still discussions about the bias that can be introduced in an assembly if the genome coordinates are used. I would like to know if the authors compared the StringTie nonreference based approach with the reference-based and they opinion about the bias of using annotated genomes files during the assembly.
Ans1. We agreed and thanked the reviewer's suggestion. In our pipeline, we initially used genome-guided mapping (using HISAT2/minimap2), producing a non-reference-based assembly as input for our pipeline (using StringTie). As suggested by this reviewer, we tested reference-based StringTie assemblies, obtaining equal or better F1-scores against truth GTF annotations from NCBI, when compared with non-reference StringTie based assemblies (please see Figure 3). Therefore, we recommend employing reference-based StringTie assemblies in our pipeline if good reference annotations are available. We expressed these observations in the manuscript. Concerning the use of Trinity, we didn't use this assembler because of their large memory requirements. The module Chrysalis on the Trinity pipeline requires more than 200Gb of RAM when a 28Gb Illumina sequencing dataset is assembled. Last, but not least, Trinity is a de-novo assembler for short reads which doesn't account for long read integration. The commands to create the assembly as we did on the present work are described on our pipeline's Wiki: https://github.com/cfarkas/annotate_my_genomes/wiki#ii-obtaining-stringtie-gtf-file-for-annotation Q2. The manuscript should include a comparison between the annotation pipeline and the already published pipelines mentioned in the Background section. There should be a discussion on advantages and disadvantages of the author's proposal compared with the existent pipelines.
Ans2. We agreed and thanked the reviewer's suggestion. We have included a new paragraph in the background section (see lines 62-86) mentioning published pipelines. Based on our benchmark, we discussed the advantages and disadvantages (such as sensitivity and precision calls) of the different pipelines (see lines 249-269 in Results section, and lines 329-336 in Discussion section, respectively).
Q3. There should be a discussion about running times and computer resources required by the pipeline. I would like to see, if possible, a time estimation for other organism like human and mouse.
Ans3. We agreed with this reviewer and performed these calculations. A detailed comparison among different pipelines for five organisms is included in Supplementary Figure 3. Q4. Often, RNA-Seq data contains contamination. How this pipeline can handle contaminated raw reads? If it require non-contaminated reads, then it should be clear in the manuscript Ans4. We fully agreed with the reviewer. We mention the Illumina preprocessing steps between lines 400 and 405 our pre-processing methodology for Illumina short reads. In this work, we choose the well-known tool fastp, however, it is important to mention that nowadays several tools are available for QC pre-processing; we expect that users choose the one that suits them better according to their dataset. Q5. I went over the source code. It is mainly complex BATCH scripts. Nowadays, computational biology data analysis pipelines are implemented using workflow languages and executed through workflow managers like CWLtool, Nextflow or snakepipes, just to mention a few. These workflow languages and manager allow portability, scalability and more important reproducibility of the results. The current source code shows a very complex set of